Synthetic population generation for Beijing¶

The demo shows how you can generate the synthetic population of China using SyntheticPopulation.jl package.

1. Include SyntheticPopulation package¶

In [1]:
include("../src/SyntheticPopulation.jl")
using PyCall
using Colors
folium = pyimport("folium")
Out[1]:
PyObject <module 'folium' from '/Users/marcinzurek/.julia/conda/3/x86_64/lib/python3.10/site-packages/folium/__init__.py'>

2. Provide marginal data for individuals and households¶

For the purpose of this demo, we are using data from:

  • Beijing Statistical Yearbook: https://nj.tjj.beijing.gov.cn/nj/main/2021-tjnj/zk/indexeh.htm
  • China Statistical Yearbook: http://www.stats.gov.cn/sj/ndsj/2019/indexeh.htm

The data is input manually in a form of Julia DataFrame. We are using the following tables:

  • For attributes of individuals:
    • 3-6 PERMANENT POPULATION BY AGE COMPOSITION (2020)
    • 2-13 Population by Sex, Marital Status and Region (2018)
    • 5-4 PER CAPITA INCOME AND CONSUMPTION EXPENDITURE OF HOUSEHOLDS OF THE WHOLE CITY (2015-2020)
  • For attributes of households:
    • 3-8 FAMILY SIZE OF PERMANENT POPULATION (2020)

Because of large population of Beijing, the data is scales by 0.1% to decrease computational costs.

Each of dataframes provides information about the distribution of population. We can only access marginal distributions of attributes (e.g. population by age and sex).

In [2]:
#each individual and each household represent 1000 individuals or households
SCALE = 0.001
Out[2]:
0.001
In [3]:
#all values are based on China census data
popoulation_size = 21890000

#individuals
marginal_ind_age_sex = DataFrame(
    sex = repeat(['M', 'F'], 18),
    age = repeat(2:5:87, inner = 2), 
    population = SCALE .* 10000 .* [52.6, 49.0, 48.5, 44.8, 33.6, 30.6, 34.6, 28.8, 71.6, 63.4, 99.6, 90.9, 130.9, 119.4, 110.8, 103.5, 83.8, 76.4, 84.2, 77.7, 84.2, 77.8, 82.8, 79.9, 67.7, 71.0, 56.9, 62.6, 31.5, 35.3, 18.5, 23.0, 15.2, 19.7, 12.5, 16.0]
    )


marginal_ind_sex_maritialstatus = DataFrame(
    sex = repeat(['M', 'F'], 4), 
    maritialstatus = repeat(["Not_married", "Married", "Divorced", "Widowed"], inner = 2), 
    population = SCALE .* [1679, 1611, 5859, 5774, 140, 206, 128, 426] ./ 0.00082
    )


marginal_ind_income = DataFrame(
    income = [25394, 44855, 63969, 88026, 145915], 
    population = repeat([popoulation_size * SCALE / 5], 5)
    )

#households
marginal_hh_size = DataFrame(
    hh_size = [1,2,3,4,5],
    population = Int.(round.(SCALE * 8230000 .* [0.299, 0.331, 0.217, 0.09, 0.063]))
    )
Out[3]:
5×2 DataFrame
Rowhh_sizepopulation
Int64Int64
112461
222724
331786
44741
55518

3. Generate data about population in Beijing's districts.¶

Next, we generate a Julia DataFrame which presents population breakdown per each district. This consists of 2 steps:

  1. First, we download the distrct boundaries from https://osm-boundaries.com/Map and save into Julia DataFrame.
  2. Then, we manually add a column with population for each district, based on Beijing Statistical Yearbook (table 3-4 TOTAL NUMBER AND DENSITY OF PERMANENT POPULATION (BY DISTRICT) (2020))
In [4]:
#areas
URL = "https://osm-boundaries.com/Download/Submit?apiKey=87100809b4085adb58139419c141e5a1&db=osm20230102&osmIds=-2988894,-2988933,-2988895,-288600,-2988896,-2988946,-5505984,-2988897,-2988898,-2988899,-2988900,-5505985,-2988901,-2988902,-568660,-2988903&format=GeoJSON&srid=4326"
areas = generate_areas_dataframe(URL)

#aggregated_areas - population referenced from https://nj.tjj.beijing.gov.cn/nj/main/2021-tjnj/zk/indexeh.htm
aggregated_areas = copy(areas)
aggregated_areas.:population = SCALE .* 10000 .* [56.8, 313.2, 201.9, 345.1, 34.6, 184.0, 132.4, 45.7, 52.8, 39.3, 44.1, 131.3, 199.4, 226.9, 110.6, 70.9]
aggregated_areas = aggregated_areas[:, [:id, :population]]
areas
Downloading file... 
File downloaded. Unzipping file...
File saved at /Users/marcinzurek/Desktop/Studia/Research/SyntheticPopulation/notebooks/file.geojson
Out[4]:
16×3 DataFrame
Rowidgeometryname
Int64GeometryString
11PolygonShijingshan District
22PolygonHaidian District
33PolygonFengtai District
44MultiPolygonChaoyang District
55PolygonYanqing District
66PolygonTongzhou District
77PolygonShunyi District
88PolygonPinggu District
99PolygonMiyun District
1010PolygonMentougou District
1111PolygonHuairou District
1212PolygonFangshan District
1313PolygonDaxing District
1414PolygonChangping District
1515PolygonXicheng District
1616PolygonDongcheng District

4. Merge marginal distributions to obtain joint distributions for multiple attributes¶

As the next step, we merge marginal distributions of attributes of individuals to obtain join distributions for attribute combinations. This is done using a recursive algorithm that leverages Iterative Proportional Fitting Method proposed by Guo, Bhat, 2007 [1]. In addition, our algorithm is improved because it can be configured with a JSON file to provide more flexibility. This approach is insipired by Ponge, Enbergs et al. 2021. The configuration documentation is described elsewhere.

[1] Guo, J. Y., & Bhat, C. R. (2007). Population synthesis for microsimulating travel behavior. Transportation Research Record, 2014(1), 92-101.
[2] Ponge, J., Enbergs, M., Schüngel, M., Hellingrath, B., Karch, A., & Ludwig, S. (2021, December). Generating synthetic populations based on german census data. In 2021 Winter Simulation Conference (WSC) (pp. 1-12). IEEE.

In [10]:
individuals, aggregated_individuals = generate_joint_distributions(marginal_ind_sex_maritialstatus, marginal_ind_age_sex, marginal_ind_income, config_file = "config_file.json")
[ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9520, 9502]
[ Info: Converged in 1 iterations.
[ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9777, 9166]
[ Info: Converged in 1 iterations.
[ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [18668, 21890]
[ Info: Converged in 1 iterations.
Out[10]:
(568×5 DataFrame
 Row │ id     maritialstatus  sex    age     income  
     │ Int64  String?         Char?  Int64?  Int64?  
─────┼───────────────────────────────────────────────
   1 │     1  Not_married     M          22    25394
   2 │     2  Married         M          22    25394
   3 │     3  Divorced        M          22    25394
   4 │     4  Widowed         M          22    25394
   5 │     5  Not_married     F          22    25394
   6 │     6  Married         F          22    25394
   7 │     7  Divorced        F          22    25394
   8 │     8  Widowed         F          22    25394
   9 │     9  Not_married     M          27    25394
  10 │    10  Married         M          27    25394
  11 │    11  Divorced        M          27    25394
  ⋮  │   ⋮          ⋮           ⋮      ⋮        ⋮
 559 │   559  Divorced        F          87   145915
 560 │   560  Widowed         F          87   145915
 561 │   561  missing         F           2  missing 
 562 │   562  missing         F           7  missing 
 563 │   563  missing         F          12  missing 
 564 │   564  missing         F          17  missing 
 565 │   565  missing         M           2  missing 
 566 │   566  missing         M           7  missing 
 567 │   567  missing         M          12  missing 
 568 │   568  missing         M          17  missing 
                                     547 rows omitted, 568×2 DataFrame
 Row │ id     population 
     │ Int64  Int64      
─────┼───────────────────
   1 │     1          31
   2 │     2         107
   3 │     3           3
   4 │     4           2
   5 │     5          43
   6 │     6         150
   7 │     7           4
   8 │     8           3
   9 │     9          56
  10 │    10         196
  11 │    11           5
  ⋮  │   ⋮        ⋮
 559 │   559           1
 560 │   560           2
 561 │   561         490
 562 │   562         448
 563 │   563         306
 564 │   564         288
 565 │   565         526
 566 │   566         485
 567 │   567         336
 568 │   568         346
         547 rows omitted)

We do the same for a dataframe with households to generate households pool.

In [11]:
households, aggregated_households = generate_joint_distributions(marginal_hh_size)
Out[11]:
(5×2 DataFrame
 Row │ id     hh_size 
     │ Int64  Int64   
─────┼────────────────
   1 │     1        1
   2 │     2        2
   3 │     3        3
   4 │     4        4
   5 │     5        5, 5×2 DataFrame
 Row │ id     population 
     │ Int64  Int64      
─────┼───────────────────
   1 │     1        2461
   2 │     2        2724
   3 │     3        1786
   4 │     4         741
   5 │     5         518)

5. Assign individuals to households¶

Once the individual pool and household pool is generated, we assign individuals to households. The individuals are assigned to households based on the following rules:

  1. Each household consists of household head.
  2. If household size is 2, then we draw a partner of opposite gender aged +/- 5 years from household head.
  3. If houhsehold size is 3-5, we fill the rest of household spots with childredn. Age difference between parents and child is between 15 and 40.
In [13]:
disaggregated_households, unassigned = assign_individuals_to_households(individuals, aggregated_individuals, households, aggregated_households, return_unassigned = true)
Total_number of individuals: 21885
Total_number of households: 8230
Allocation started... 
Progress: 100%|█████████████████████████████████████████| Time: 0:00:06
Allocated 86.0% individuals.
Allocated 100.0% households.
Out[13]:
(8230×9 DataFrame
  Row │ id     hh_attr_id  hh_size  individuals_allocated  head_id  partner_id ⋯
      │ Int64  Int64       Int64    Bool                   Int64    Int64      ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │     1           1        1                   true      489           0 ⋯
    2 │     2           1        1                   true      493           0
    3 │     3           1        1                   true      233           0
    4 │     4           1        1                   true      152           0
    5 │     5           1        1                   true      287           0 ⋯
    6 │     6           1        1                   true      412           0
    7 │     7           1        1                   true      457           0
    8 │     8           1        1                   true      532           0
    9 │     9           1        1                   true      245           0 ⋯
   10 │    10           1        1                   true      237           0
   11 │    11           1        1                   true       75           0
  ⋮   │   ⋮        ⋮          ⋮               ⋮               ⋮         ⋮      ⋱
 8221 │  8221           5        5                   true      394         222
 8222 │  8222           5        5                   true      290         246 ⋯
 8223 │  8223           5        5                   true      474          78
 8224 │  8224           5        5                   true       62         410
 8225 │  8225           5        5                   true       10         326
 8226 │  8226           5        5                   true       70         282 ⋯
 8227 │  8227           5        5                   true      346         254
 8228 │  8228           5        5                   true      242         286
 8229 │  8229           5        5                   true      286         306
 8230 │  8230           5        5                   true      234         398 ⋯
                                                 3 columns and 8209 rows omitted, Dict{String, DataFrame}("unassigned_individuals" => 471×2 DataFrame
 Row │ id     population 
     │ Int64  Int64      
─────┼───────────────────
   1 │     1           3
   2 │     5           1
   3 │     6           3
   4 │     9          21
   5 │    10           8
   6 │    11           2
   7 │    12           3
   8 │    13          23
   9 │    14           3
  10 │    15           2
  11 │    16           1
  ⋮  │   ⋮        ⋮
 462 │   557           2
 463 │   558           1
 464 │   561          16
 465 │   562          12
 466 │   563           8
 467 │   564           6
 468 │   565          12
 469 │   566           8
 470 │   567           6
 471 │   568          12
         450 rows omitted, "unassigned_households" => 0×2 DataFrame
 Row │ id     population 
     │ Int64  Int64      
─────┴───────────────────))

6. Assigning geographical coordinates for households.¶

Once each household is filled with individuals of certain type, we assign coordinates for each household. This is done in 2 steps:

  1. We draw a district for each of the households.
  2. We draw random coordinates that are within the polygon that represents district area.
In [14]:
disaggregated_households, unassigned = assign_areas_to_households!(disaggregated_households, areas, aggregated_areas, return_unassigned = true)
disaggregated_households
===================
Assigning coordinates to households...
===================
Progress: 100%|█████████████████████████████████████████| Time: 0:00:29
Out[14]:
8230×11 DataFrame
8205 rows omitted
Rowidhh_attr_idindividuals_allocatedhead_idpartner_idchild1_idchild2_idchild3_idlatlonarea_id
Int64Int64BoolInt64Int64Int64Int64Int64Float64Float64Int64
111true489000039.8338116.213
221true493000039.8864116.7366
331true233000039.8994116.5214
441true152000039.9508116.3422
551true287000039.6194115.58212
661true412000039.7954116.4233
771true457000039.5492116.36113
881true532000039.6505116.8436
991true245000039.8015116.3533
10101true237000039.7172116.7136
11111true75000039.947116.2792
12121true401000040.2152116.6997
13131true9000040.266116.33414
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
821982195true234310134345439.8441116.5254
822082205true482350568342639.9024116.4116
822182215true39422256745456140.0025116.2622
822282225true29024656156823039.8841116.33415
822382235true4747856656356440.0491116.3122
822482245true6241033845456239.8415116.3913
822582255true1032623011856839.619116.4313
822682265true7028256156756239.6206116.24313
822782275true34625456456856240.0993116.1322
822882285true24228656756556140.1734116.28714
822982295true28630656356333840.0472116.2362
823082305true23439856556656740.0092116.3854

7. Visualisation¶

In the dataframe disaggregated_households each row represents one household. For each household we assigned some household members, each of them being characterized by a set of attributes. Also, each household has coordinates that show the location of the household.

Let's visualise it:

In [15]:
m = folium.Map(location = [disaggregated_households.lat[1], disaggregated_households.lon[1]], zoom_start=11)
i = 1
for area in unique(disaggregated_households.area_id)
    colrs = distinguishable_colors(length(unique(disaggregated_households.area_id)), [RGB(1,0.6,0.5)])
    hh_color = "#$(hex(colrs[i]))"
    i += 1
    area  = filter(row -> row.area_id == area, disaggregated_households)
    for i in 1:nrow(area)
        folium.Circle(
            location = (area.lat[i], area.lon[i]),
            radius = 100,
            color = hh_color,
            fill = false,
            fill_color = hh_color
        ).add_to(m)
    end
end
m
Out[15]:
Make this Notebook Trusted to load map: File -> Trust Notebook